A Hybrid Approach to Chinese Word Segmentation around CRFs
نویسندگان
چکیده
In this paper, we present a Chinese word segmentation system which is consisted of four components, i.e. basic segmentation, named entity recognition, error-driven learner and new word detector. The basic segmentation and named entity recognition, implemented based on conditional random fields, are used to generate initial segmentation results. The other two components are used to refine the results. Our system participated in the tests on open and closed tracks of Beijing University (PKU) and Microsoft Research (MSR). The actual evaluation results show that our system performs very well in MSR open track, MSR closed track and PKU open track.
منابع مشابه
Combining Character-Based and Subsequence-Based Tagging for Chinese Word Segmentation
Chinese word segmentation is the initial step for Chinese information processing. The performance of Chinese word segmentation has been greatly improved by character-based approaches in recent years. This approach treats Chinese word segmentation as a character-wordposition-tagging problem. With the help of powerful sequence tagging model, character-based method quickly rose as a mainstream tec...
متن کاملCRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data
In this paper, we proposed a Chinese word segmentation model for micro-blog text. Although Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chinese micro-blog. Different from the genres of common articles, micro-blog has gradually become a new literary with the development o...
متن کاملRules-based Chinese Word Segmentation on MicroBlog for CIPS-SIGHAN on CLP2012
In this evaluation, we have taken part in the task of the Word Segmentation on Chinese MicroBlog. In this task, after analysing the feature of the MicroBlog and the result of our original Chinese word segmentation system, four Optimization Rules are proposed to optimize the segmentation algorithm for Chinese word segmentation on MicroBlog corpora. The optimized segmentation system is based on c...
متن کاملChinese Segmentation and New Word Detection using Conditional Random Fields
Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the integration of domain knowledge in the form of multiple lexicons of characters and words. We als...
متن کاملA Hybrid Markov/Semi-Markov Conditional Random Field for Sequence Segmentation
Markov order-1 conditional random fields (CRFs) and semi-Markov CRFs are two popular models for sequence segmentation and labeling. Both models have advantages in terms of the type of features they most naturally represent. We propose a hybrid model that is capable of representing both types of features, and describe efficient algorithms for its training and inference. We demonstrate that our h...
متن کامل